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Hierarchical Parallelism in Finite 
Difference Analysis of Heat Conduction 

Joseph Padovan , 1 Lala Krishna 1 and Douglas Gute 2 
The University of Akron, Akron, Ohio 

PART I - FORMULATION 
SUMMARY 

Based on the concept of hierarchical parallelism, this series of papers develops highly efficient 
parallel solution strategies for very large scale heat conduction problems. In addition to yielding 
a many order of magnitude improvement in computational speed, the methodology reduces round 
off as well as introduces a significant solution stabilization when used in conjunction with 
iterative procedures. Overall, the method of hierarchical parallelism involves the partitioning of 
thermal models into several substructured levels wherein an optimal balance in the various 
associated bandwidths is achieved. The solution to the problem is then developed in parallel via 
special direct, iterative or mixed (direct/iterative) procedures wherein each partition is monitored 
for its intrinsic spectral properties so as to enable the choice of the appropriate local solution 
algorithms. Overall, the paper is organized into to parts. The first develops the parallel 
modeling methodology and associated multilevel direct, iterative and mixed solution schemes. 
Part II establishes both the formal and computational properties of the scheme. Here emphasis is 
given to establishing convergence characteristics, spectral properties as well as the choice of the 
appropriate solution accelerators. 


1 Department of Mechanical Engineering , The Univerisity of Akron. 

2 Department of Mathematics, The University of Akron. 



1 . 


Introduction 


In recent years the almost unlimited promise afforded by the first 
several generations of sequential type computer architectures has essen- 
tially been saturated. Motivated by this, extensive ongoing work has 

been undertaken to develop new types of computer architectures 

[ 1 , 2 ] . 

These principally fall into the following categories , i.e.: 

i) Vectorized/pipelined systems; 

ii) Parallel systems; and, 

iii) Combined systems. 

Many of the new systems will have a hierarchy of operational modes. 
Namely, depending on the repetition level of a given block of code, three 
operational modes will be possible. Specifically, below a given repeti- 
tion level, the code will be performed sequentially. For an intermediate 
range, the coding block will be performed vectorially. At yet higher 
levels, the coding is copied to a set of parallel processors where it is 
performed locally in a multiply vectorized format. Such a many option 
scheme is essentially dependent on the intrinsic lower and upper bound 
performance characteristics of the associated scalar vector and parallel 
processors. Included in the decision making are the associated communi- 
cations costs. 

While such procedures have added to our current capacities, generally, 
the efficiencies have become saturated as vector lengths and/or the number 
of processors have grown too large. Prototypically , in parallel systems 
employing such schemes as the super cube technology running speeds 

tend asymptotically to be diminishing fractions of the number of process- 
ors employed Similarly, in vectorized systems generally vector length 
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is limited to certain lower and upper bound dimensions to yield optimal 
results. This limitation also reduces the full potential of such schemes. 

Beyond these restrictions, the general tendency today is to attempt 
to solve thermal simulation problems in a strictly global manner. As will 
be seen, such approaches tend to yield extremely large sets of equations 
with their concommitant bandwidth problems. In an attempt to bypass such 
difficulties typically some form of bandwidth minimization scheme is 
employed ^ . While such an approach yields some relief, even optimized 

global models represent significant difficulties. 

In the context of the foregoing, this series of papers will develop 
an alternative architectural strategy to handle very large scale thermal 
simulations. Specifically, a multilevel, i.e., hierarchical form of prob- 
lem partitioning/substructuring will be developed. As will be seen, such 
a scheme will enable so-called P artitional/substructural/local bandwidth 
minimization at the various hierarchical levels of the scheme. This will 
enable orders of magnitude improvement of the computational speed of finite 
difference and element (FD/'FE) simulations when separate processors are 
defined for each partition. Additionally, when used with direct solvers, 
the associated round off error is significantly reduced thereby lightening 
machine load. 

In addition to establishing the modelling architecture, several associ 
ated solution methodologies are developed. These include: 

i) Purely direct schemes; 

ii) Mixed direct and iterative procedures; and, 

iii) Purely iterative methodologies. 


For the mixed scheme, various partitions 


i.e. substructure, are handled 
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via either direct or iterative procedures. Such a treatment being con- 
tingent on the associated spectral conditioning of the partition. To 
generalize the development, a variety of schemes are considered and modi- 
fied, i.e. Jacobi Gauss-Seidel t65 , successive overrelaxation (SOR) 

161 and conjugate gradient t7j . As will be seen, the hierarchicalism 
tends to introduce significant improvements in the stability and efficiency 
of iterative schemes. 

Beyond the purely developmental aspects, formal numerical properties 
will also be given. This will include such items as convergence charac- 
teristics, spectral properties, relative efficiencies of the various schemes, 
as well as the selection of various optimizing parameters for the SOR and 
conjugate gradient methodologies. 

To prove out the scheme, the results of a variety of benchmark ex- 
amples will be discussed. Overall these include moderate to large scale 
heat conduction problems ranging in size from models of 10,000 to 250,000 
degrees of freedom. 

Overall the two-part series of papers is organized as follows: 

i) Part I 

1.1 Overview of previous work. 

• FD analysis of heat conduction 

• New computer environments 

1.2 Current Approach 

• Substructural parallelism 

• Local bandwidth minimization 

1.3 Algorithmic Considerations 
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ii) Part II 

II. 1 Algorithmic overview 

XX. 2 Formal numerical properties 

• Convergence characteristics 

• Spectral properties 

, • Relative convergence properties 

• Selection of optimal acceleration parameters 

• Influence of hierarchicalism 

11 . 3 Benchmarking including heat conduction problems. 

11. 4 Final summary. 
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2 . Previous Work 

As noted in the introduction, FD schemes are widely employed in heat 
transfer analysis. Most of such applications have involved the use of 
sequential and at best pipe lined/vectorized computers, i.e. CRA'r and FPS 
type systems ^ ^ » To motivate the development of the hierarchical parallel 
scheme noted earlier in the Introduction, this section will 


i) Briefly overview FD modelling procedures 

ii) Describe new computer environments and their impact on FD analy- 
sis; and, 

iii) Overview shortcomings of proposed computer configurations, i.e. 
pipe lined/vectorized and parallel type systems. 


2.1 Overview of FD Analysis 

Assuming isotropic media for simplicity, the governing Fourier heat 

[8 ] 

conduction relation takes the form 


V 2 T + Q = 0 (2.1) 

. . [ 8 ] 

wherein 

V 2 ( ) = sL ( ) + ( ) (2.2) 

9x^ 3x^ 3*2 

such that T is temperature and x^, are the Cartesian coordinates. 

The boundary conditions associated with (2.1) are either prescribed temp- 

_ [ 8 ] 

erature, flux or of the convection-radiation type, namely 


i) for V x e 3 R t 
T = T 


ii) for V x t 3R^ 




II 

3x 


\ = 


(2.3) 


(2.4) 
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iii) for V x e 3R 


cr 


< n. = h(t-t ) + y(t 4 -t 4 ) 

3x. i 00 00 

1 


(2.5) 


where T, q, are the prescribed temperature and heat flux components, and 
Q, k, H, T , y, 3R^,, 3Rq, 3R c „ are respectively the internal heat genera- 
tion thermal conductivity, convection coefficient, ambient temperature, 
radiation coefficient, and the surface areas defining prescribed tempera- 

ture, flux and convection-radiation. 

For demonstration purposes, (2.1) will be simulated via 5-point 2D 

[ 6 ] 

and 7-point 3D FD expressions. In this context as noted by Varga , 
(2.1) and its associated boundary conditions, i.e. (2.3 - 2.5) can be 
converted into the following FD formulation, namely 


[ K] T = Q 


( 2 . 6 ) 


ere T defines the (N,l) column vector of mesh point temperature, Q the 

V N,1) column vector defining internal heat generation and boundary condi 

tion effects (purely linear) and lastly IK] is an (N,N) five diagonal 

(2-D case) matrix. Since [ K] is generally a positive definite symmetric 

matrix it fits in the class of matrices termed Stieltjean . As will . 

be seen in Part II, this property will aid us to establish the convergence 

properties of the parallel solution strategy. 

Note depending on problem size, the bandwidth of |K] will change 

accordingly. Furthermore, contingent on the boundary conditions and 

connectivity of the problem, the bandwidth need not be uniform. Generally 

( 10 ] 

(2.6) is solved either iteratively or directly via either a skylined 
or frontal scheme For extremely large scale problems, typically 
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[12 ] 

some form of out of core blocking is required to handle the direct 

inverse. Generally even with the use of say solid state disks, this re- 
quires extensive amounts of CPU (central processor unit) and real clock 
times. Hence, large scale problems with even modest average 
bandwidths tend to tax mainframe capacities. This is one of the primary 
factors motivating the ongoing thrust to develop alternative computer 
architectures. 

2.2 New Computer Environments 

Currently there appears to be essentially two major forms of new 
computer architecture, i.e.: 

i) Pipe lined/vectorized; and, 
ii) Purely parallel 

For the pipe lined and vectorized machines, while the overall problem is 
still solved in an essentially segmential sense, similar noninterdependent 
operations can be performed in vectorized chunks, i.e. matrix multiplica- 
tion, subtraction, addition, etc. The vector/pipelined architectures 
involve the use of numerous individual processors all preprogrammed for 
certain fixed duties. In this context, the size of the chunks operated 
on are contingent on the number of available individual processors and 
associated memories. 

In the case of so-called parallel systems, the overall machine archi- 
tecture consists of several to numerous individual processors each with 
its own distinct I/O and instruction set capacities. Currently the cross- 
talk between the various separate processors is achieved via various con- 
voluted connection schemes, i.e. the super cube methodologies used in 

r 4 ] 

the FPS 100 series , the connection machine supported by DAHPA, etc. 
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Each of the various foregoing technologies f while providing signifi- 
cant improvements over the traditional single processor frames nonetheless 
themselves possess very important shortcomings . Organized according 
to architectural type, these include: 

i) Pipelined/vector architectures 

• Extensive memory requirements in main CPU; 

• Very awkward to program; 

• Difficult to employ in multi-user environment; 

• Difficult to arrange programing to efficiently handle prob- 
lems with multitudinous substructure, i.e. material groups, 
boundary conditions, complex boundaries; 

• Code structure too dependent on machine architecture to 
be economically viable; 

• Many operations required at local level are scalar, hence 
slowing down overall throughput; 

• I/O awkward due to multitudinous degrees of freedom; 

• particularly awkward to provide for mesh refinement, namely 
wherein any portion of the model may be densified, all 
storage and array alignments must be reconfigured during 
refinement process; and, 

• Concentration of all computing power in one machine is not cost 
effective except for large well funded government research 
installations. 

ii) Parallel machines 

• Awkward to program; 

• Communication between parallel processors usually awkward, 
i.e. the super cube; 
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• Controlling languages highly machine dependent; coding 
lacks portability; 

• I/O typically awkward; 

• Compute power of given node usually limited; and, 

• Parallel architectural methodologies lacking; 

Much of the foregoing problems stem from the fact that to date 
machine architecture has been dictated by hardware-software considera- 
tions as well as the needs of the prevailing bulk user, i.e. the service 
industry. Because of this, the natural generic features of the physics 
of an engineering problem and its associated analytical numerical formu- 
lation must be subverted to satisfy the needs dictated by machine capa- 
bilities. The next section will establish a more natural connection 
scheme. 
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Current Approach 


For FD simulations such as defined by (2.6), the overall computer 
load falls into two main categories: 

(1) The generation of mesh point equations and their associated 
global assembly, and 

(2) The solution of (2.6) with its concommitant large bandwidth. 

In the case of large scale simulations, the use of either direct (sky- 
line, frontal or iterative (least square * Jacobi, Gauss- 

Seidel, SOR, preconditioned conjugate gradient PCG) schemes tends to 
tax even the largest main frames whether pipelined, vectorized, or parallel, 
i.e. the CRAY-XMP, vectorized IBM-3090-400, FFS-200, etc. In this con- 
text, it follows that as currently envisioned and configured, neither 
vectorized/parallelized nor straight parallel systems provide the total 
answer to improved computer architectures. 

From a purely philosophical point of view, the actions and reactions 
in real physical situations whether steady state or transient occur in 
an essentially concurrent format. Such behavior is modelled by the equa- 
tions of continuum mechanics. Because of this , the ideal computer archi- 
tecture should be able to simulate such behavior in its own architecture, 
i.e.: 

i) To be able to handle steady problems wherein all the various 
system components are interacting; and, 

ii) To handle transient wave front problems wherein the zones of 

concurrent interaction grow as the waves signalling the infor- 
mation flow spread. 
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Such modelling requirements point to concurrent parallel processors 
whose individual capacities are enhanced by pipelined/vectorized attri- 
butes * 

From a purely industrial point of view, as noted earlier, a single 
concentrated source of computing would not be economically feasible. A 
corporations prototypical overall organization is divided into separate 
departments designing/developing/analyzing various components of a given 
product. Hence; they represent a multiple computer user base. 
Generally computer requirements fall into several main categories, 
namely: 

i) To design/develop/analyze individual product components; 

ii) To provide a common data base for the various interconnecting 
product components; and, 

iii) On occassion to enable running highly refined component models 
or overall simulations of the entire product. 

Such industrial/institutional organizational schemes place competing 
somewhat contradictory requirements on the computing facilities, i.e.: 

i) The need for localized distributed computing; 

ii) Data base networking; and, 

iii) Significant central processing capability. 

Interestingly, these needs are not unlike those required by a particu- 
larly large scale FD simulation, in this case, one involving heat con- 
duction. 

3.1 Substructural Parallelism 

Based on the foregoing, a so-called hierarchically parallel 
methodology and associated direct and iterative solution strategy will be 
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developed. Specifically, the overall problem will be partitioned into 
a set of substructures each acting as separate conduction problems. 
These partitions constitute the first hierarchical level. Note, each 
of the substructural problems will be provided with the appropriate 
interlinking boundary conditions to provide the proper global conserva- 
tion of energy and field variable continuity. 

Noting Fig. 3.1, each separate partition will have its own set 
of internal and boundary/external mesh points. Hence, (2.6) 
takes the form 

= Q l ; u [l.L] (3.1) 


where L is the number of substructure. 



and [K ] is partitioned in the form 


[K 2 -] = 


"'A 1 

Ck ie j 


1^3 


(3.2) 


(3.3) 


(3.4) 


such that the superscript i and the subscripts I, E, and IE/EI respectively 
define the partition number, internal, external and connection blocks of 

The mesh points lying on the boundaries of a given substructure may 
themselves fall- into two categories, i.e.: 
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i) Those boundary mesh points shared by two distinct sub-domains; 
and, 

ii) Those shared by several, i.e. see Fig. 3.1. 

Based on the segregation of variables into internal and those on dual 
and multiple boundaries, we see that (3.1) can be recast in the form: 


i) Internal Regions; 

l I I -I 

I?* + ^I k DB^-DB + C I K MB^-MB 


ii) Dual Boundaries; 

( ]T*" = 

l DB K DB-DB 

DB-^ + ( MB K DB ] -I + ^DB^W-MB 


(3.6) 


iii) Multiple Boundaries; 


^MB^MB^ -MB 

MB-^ + ^MB K DB^-DB + [ MB K I ] -I 


(3.7) 


such that here: 

[ j^] - Conductivity matrix of substructure 

[ K*” ] " Coupling block between £ th interior variables and £ th 
I DB 

dual boundary variables 

- Coupling block between interior and multiple boundary 
variables 
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1 - Conductivity matrix of dual boundary variables 


r K 2 1 - CouDling block between dual and multiple boundary variables 
l DB'MB J * 6 

^MB*4b^ ” Conductivit y matrix multiple boundary points 

and T^, and respectively are the interior, dual boundary and 

£ 

multiple boundary temperatures. Due to the structure of [K ] and its 
various partitions, it follows that 


^I%B^ " ^DB K I J ' 


(3.8) 


= W K i^' 


(3.9) 


r v 2- l = r v 2, i 1 

1 db^mb 1 1 mb k db j 


(3.10) 


where ( )' designates matrix transposition. 

Note, depending on the FD operator employed, i.e. 5, 7, 9, 13 point 
or higher order, the various off diagonal blocks [^K^ 2 ],... will have 
different degrees of coupling. For instance, in the case of low level 
operator (5, 7 point), the structure of fjKjjgl**** be essentially 

empty hence yielding a so called sparse coupling. In contrast, for 
higher order FD operators the structure of will be quite 

complex hence yielding a dense coupling. Unlike the FD, use of FE analysis 
will prototypically yield a dense coupling between levels. 

After direct assembly, (3.5) - (3.7) yields the following global 
second level relations, namely: 

Z(Eq. (3.5) ) 
i ‘ 

Wh-hWlDB-hWlMB-®! <3 - n) 
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(3.12) 


H (Eq. (3.6)) 
l 

^DB K DB^-DB " ^DB K I^-I ' ^DB^B^MB = -DB 


S (Eq. (3.7)) 

Z 1 


[ MB%B ] -MB ‘ [ MB K DB ] IdB " ^MB K I 3 -I -MB 


(3.13) 


where for. example [^K^] is a block diagonal matrix, that is 


r K ] = 

L i i J 




[ 0 ] 


to] 






(3.14) 


Furthermore, Tj takes the following partitioned form 


r 





(3.15) 


The size of the various partitions is given by 
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w - 

(N x , N x ) 

C I K DB^ 

" < N r V 

[ I%B ] 

(N I’ n mb^ 

^DB*DB^ 

^ n db’ n db 

f DB%B^ 

(n db* n mb 

^MB^MB^ 

< ' N MB’ n mb 


(3.16) 


where N T , N-,, and N.._ respectively denote the number of total internal, 
I DB MB 

dual and multiple boundary mesh points. Note if N is the total number 
of degrees of freedom, then 


N = N + N + N 
W W I DB M 


(3.17) 


where 


*1 - * n i 


(3.18) 


such that Nj is the number of internal mesh points of the £ substructure. 

Recast in global matrix form, (3.11) - (3.13) yields the expres- 
sion 


[KjjlT - Q 


Q + CtKyJ + [K l J)T 


(3.19) 


where 


tK D 3 = 


“w 

[0] 

[0] - 

[0] 

^DB^DB^ 

[0] 

_ [0] 

[0] 

^MB*W - 


(3.20) 
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[Kjj] + tK L ] 


" (0] 


[ i w 

[ DB^ 

[0] 

[ db k mb 

- *-MB V 

[ mbW ^ 


such that 

IV = 


"[0] 


[ I K M3 J 

[0] 

[0] 

^db k mb 

[0] 

[0] 

[0] 


(3.21) 


(3.22) 


[K L ] - [Ky] 1 ‘ (3.23) 

As can be seen from (3.20), [K^] is a block diagonal matrix. The matrices 
fKy] and (K^l are respectively of upper and lower triangular form with 
zero blocked diagonals. 

Due to its form, (3.19) provides a two level, i.e. hierarchically 
parallel organization to the governing equations. It can be employed 
in several ways, namely: 

1. If all the internal variables are eliminated, then the extern- 
al nodes appearing on the various substructure tend to con- 
vert them into super elements, i.e. as per finite element 
analysis; note such a procedure would be undertaken prior to 
assembly and performed simultaneously, i.e. in parallel; 
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since it only retains external variables, the resulting 
assembled global formulation would be significantly reduced 
in size and hence more manageable; 

2. Equation (3.19) could be used to establish three basic types 
of solution algorithms, i.e. 

• Direct elimination both locally and globally; 

• Enable a mixed procedure wherein the various partitions 
are handled either iteratively or directly; and, 

• Both levels of the formulation are handled iteratively; 
lastly 

3. The hierarchical parallelism can be carried out in general 
global-local levels which enable bandwidth minimization on a 
local rather than a full problem basis. Fig. 3.2. 

Note, due to the restructuring of (2.6), (3.19) is in P-cyclic 
form for five and seven point FD simulations. Specifically, noting 
(3.20) and (3.21), it follows from Varga that (3.19) is weakly 2- 
cyclic. As will be seen in Part II, this property will enable us to 
establish a very well defined range of choices for parameters to acceler- 
ate the successive overrelaxation (SOR) scheme ^ ^ . 

In a similar context, for FD simulations the various matrices appear- 
ing in (3.19) are Stieltjean Hence, the powerful results of M-matrix 

theory ^ will enable us to establish formal considerations which define 
the convergence properties of a wide variety of iterative schemes, i.e. 
Jacobi, Gauss Siedel, SOR, as well as preconditioned conjugate gradient 
schemes. These considerations will also enable comparisons between 
variations of the foregoing techniques. 
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3.2 Local Bandwidth Minimization 


For the foregoing two level partitioning process, the appropriate 
choice of the number of substructure can yield optimal results. As an 
example, noting Fig. 3.3, consider the case wherein separate processors 
and dedicated to the individual local substructure as well as to the 
assembled global level external formulation. While the optimal balance 
between the requisite number of processors and simulation size is problem 
dependent, as will be seen from the following example, the central control- 
ling factor is the balance between first and second level bandwidth mini- 
mization. In particular, consider the 2-D rectangular uniformly differ- 
enced domain defined in Fig. 3.4. For simplicity, the region will be 
partitioned into lsvel, i.e. local domains. In this context, 

if n . and n _ denote the total number of degrees of freedom per edge 
gl g- 

then it follows from Fig. 3.4 that 

n = K.(n.-l) + 1 (3.24) 

gl 11 

n^ 2 = K^Cn^-l) + 1 (3.25) 

where n^ and n^ denote the number of degrees of freedom along the edges 
of the local substructure. 

Based on (3.24) and (3.25), the total number of degrees of freedom, 
is given by the expression 


N = n . n 

g gl gl 

or in terms of n^ and we have that 

N = (^.(n,-!) + l)(< 7 (n_-l) + 1) 
g 1 i ^ ^ 

In the square case (<^ = , n^ = n 2 ) then 

N = U(n-l) + l) 2 

O 


(3.26) 


(3.27a) 


(3.27b) 
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For situations wherein (n,<)»l such that n»<, (3.27) reduces to 

Ng *x. < 1 ic^n 1 n 2 (rectangular) (3.28) 


0 "5 

Ng ~ (tc)“(n)~ (square) 


(3.29) 


After global assembly the minimum bandwidth associated with the non- 
substructured full formulation is given by the expression 

Bg = KjCn^l) + 2 (3 * 30) 

where here, ^(n^l) + 1 < < 2 (n 2 -l) + 1. For the square region, (3.30) 
reduces to 

Bg = x(n-l) + 2 

In the case that (ic,n)>>l; <<<n then 


(3.31) 


Bg •v „n < 3 ' 32 > 

Based on (3.26) - (3.32), the computational effort associated with 

the direct calculation of (2.6) takes the form 

Cg ^ | Ng ((Bg) 2 + Bg) (3.33) 

Considering the square and rectangular regions, we yield that 

Cg * \ (< 1 (n 1 -l)+l)(< 2 (n 2 -l)+l){(< 1 (n 1 -l)+D 2 + ^(n^lHl) (3.34) 

Cg *n* -q- (K(n-l)+l) 2 {(<(n-l)+l) 2 + <(n-l)-5-l} (3.35) 

Again for the case where such that x<<n, then 

Cg -v j (tc 1 n 1 ) 3 x 2 n 2 (3.36) 

(3.37) 

For the local substructural setup, the bandwidth is given by 


Cg ^ ^ (<n) 4 


B t = n l +1 


(3.38) 
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»t - " +1 


(3.39) 


The total number of degrees of freedom are 


N* * ‘ n)2 

Note for the 1 th substructure, the nearly optimal bandwidth is 
~ n 


(3.40) 

(3.41) 


(3.42) 


If a Gaussian scheme [10] is employed to solve the internally/externally 
partitioned version, then the connectivity block, i.e., [< I£ ] tends to 
yield an increased bandwidth. In particular, from an asymptotic point 


of view 


3n 


(3.43) 


Hence, the asymptotic computational effort associated with the local 


block is given by 

_ 9 4 

C 1 ' 2 n 

The optimal version yields 
^tootimal 2 n 


(3.44) 


(3.45) 


Thus asymptotically 

<V C i 0 ptimal ' 9 


(3.46) 


This state of affairs can be improved with the use of nested dissection [14] 


to condense the local partition. 

Considering the second level of the partitioned scheme, the estima- 
tion of calculation load is complicated by the sparse nature of the 
region under consideration. Noting Fig. 3.5, the total number of external 
degrees of freedom is given by the expressions 

N = (< 1 (n 1 -l)+l)(< 2 +l) + (x 2 (n 2 -l)+l)(< 1 +l) - (< 1 +1)(< 2 +1) , 

(3.47) 

N - 2(<(n-l)+l)(K+l) - (*+l) 2 (3.48) 
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In the case that (n^, n., , k ^) >> 1; > <^» then 

N ^ ic^-Cn.+n.) (3.49) 

e 1 2 1 4 

N ^ 2(ic)“n (3.50) 

e 

To establish the bandwidth of the assembled external level formula- 
tion, several factors must be taken into account. Specifically, noting 
Fig. 3.6, the second level formulation consists of families of horizontal 
and vertical mesh points. In this context, bandwidth is contingent on: 

i) Vertical positioning of horizontals; 

ii) Location within either verticals or horizontals 

Based on this, the bandwidths associated with the various external degrees of 
freedom are given by the following external family of expressions, namely: 


i) First vertical, 

B 1 .. = i 
evij 

je[lt< 1 l» ie[l»n^-l) 
ii) 2nd, 3rd k., +1 verticals, 

B^ .. = ie.(n,-l) + (ic 1 +l)(n_-2) + i 
evij 11 12 

ie[2,n.], ke[2,tc 2 +l], jetl,*^] 

B^ . = ie.(n..-l) + (<.+l)(n_-2) + 1 
evol 11 12 

ke[2, ic 7 +l] 

iii) First horizontal, 

B ih:j = lf 1 (n 1 ‘ 1 > +1 + Uj+DU+l) 

ie[l,n 7 -2], je[l,< 2 I 


(3.51) 


(3.52) 


(3.53) 


(3.54) 
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(3.55) 


iv) 2nd, 3rd,..., <^+l horizontals. 


B ehij = ( i" 1 ) " (j-2(n 1 -l)) 

ie[l,n 2 “2], ketl.^l 

j c [ 2 , ic^+1 ] 

such that the subscripts e, h and v define external, horizontal 
Averaging the above noted bandwidths we yield the expressions 


i) First vertical; ic^(n^-l) entries. 


1 n l 

<B 

ev: j 2 


ii) 2nd, 3rd,... verticals, 

v n i 

<B* ..> = x.(n.-l) + (<.+l)(n_-2) +1+ —z 

evxj 11 12 2 


ic^ic^Cn^-l) entries 


<B evol > = K l (n r 1} + (k 1 +1)( V 2) + 1 
*2 entries 


iii) First horizontals; (n 2 “2)K2 entries; 

n -i 

<B , . . > = tc.(n.-l) +1+ (<.+l) — = — - (k.+ 1) 
ehij 11 12 1 


iv) 2nd, 3rd,... horizontals, (n2 - 2)iCjiC2 entries 


n 9 -l 


. <i+l 

<B ehij > * <^(n^-l) +1+ + (<j+l) “j “ (<j+l) 


X -+1 

+ 2( ni -l) - (nj-lXl + -J-) 


and vertical. 


(3.56) 


(3.57) 


(3.58) 


(3.59) 


(3.60) 
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For the square region, we yield that 


<B ..> 

evij 


- n/2 


<B k ..> 
evij 


= <(n-l) + (ic+l)(n-2) +1+ n/2 


<B k l 5, 
evol , 


= <(n-l) + (>c+l)(n-2) + 1 


(3.61) 


<B^ V = <(n-l) +l-r (<+l) - (<+l) 

ehij 2 


<B ehij > 


= K 


(n-1) +1+ + (tc+1) - (<+l) + 


2(n-l) - (n-1) ( 1 + 


In the case that (n^ n 2> < 1> < 2 ) » 1; (n^ n 2 ) » (<j, < 2 ) then 


and 


<B . .> ^ n* /0 
evij 1/2 


<B* . .> *v K.n. + 

evij 11 12 


<B . > ^ <,n. + K.n- 
evol 11 12 


<B ehij > ' Vl + ler l n 2/2 


<B .hlJ > ' 'l”l + *l n 2/2 ' n l*l/2 


<B X . ^ n/2 

evij 


<B . . > ^ 2<n 
evij 


(3.62) 


<B* n > ^ 2>cn 
evol 


(3.63) 


<B 1 . . . > *v i- icn 
ehij 2 


<B . . . > ^ <n 
ehij 
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Based on the foregoing averages of the individual vertical and 
horizontal external level bandwidths, the overall average takes the form 


<B e > - A- (TjOy-l) <B* vi .> + V 2 (n r l) <B^ vij > + < 2 (B^ o1 ) + 

e 

( n 2*2) < 2 + 'v 2 ^! ‘“ehij’ 1 <3 ' 64) 


For the square region, (3.64) reduces to the form 


<B 


> -v i- {xn <B* > + f 2 (n-l) ,> + * <B 


N 


evol 


> + 


<(n- 2 ) < b\ . -> + < 2 (n- 2 ) <bJ\. ,>} 


(3.65) 


evij ' evij 

. . . .> + < 2 (n-2) <L ... 
ehij ehij 

As before, for the asymptotic case wherein (Kj, < 2 » n^ n^) >> 1; (n^, n 2 ) > 


( < 1 y we y ield t ^ ie expression 

<B > ^ (<n — + K^n(2icn) + 

e N Z 


3 2 

ic( 2 <n) + <n (t- icn) + ic“n(icn)} 


Since 


N *v 2<“n , 
e 


it follows that 


<B > *w -r ten 
e 2 


(3.66) 


(3.67) 


(3.68) ' 


Employing the foregoing relations, we are now in the position to 
establish the calculation effort at the 2nd global level. In particular 

C = 7 N ((<B >) 2 + <B >) (3.69) 

e 2 e e e 

To simplify the forthcoming discussion of optimization we will confine 
our development .to the asymptotic forms. These can easily be upgraded 
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for all possible choices of n^, n^, * 7 » i- e - after some extensive 

algebraic manipulations. In this context, C g takes the form 

C 'v 7 (>c) 4 (n ) 3 (3.70) 


To chose the optimal values of k and n, we must establish the net 
effort of the hierarchically parallel scheme. In this context, noting 
the local and external efforts defined by (3.46) and (3.70), we yield 
that C the total hierarchical effort is defined by the relation. 


C 


t 



+ C 

e 


9 

2 


(n ) 4 + j (<) 4 (n ) 3 


(3.71) 


At this juncture it is worthwhile 
external and total hierarchical ef 
tured approach. Recalling (3.37), 

R , = C./C -v 9 /(<) 4 

t/g «■ g 


R , = C /C "v 4.5/n 

e/g eg 


determining the ratios between local , 
forts to the straight full nonsubstruc- 
this yield the following expressions 

(3.72) 

(3.73) 


and 


l t/g 


- V c g 


“v 4.5/n + 9 / (<) 


(3.74) 


where R t / g , R e/g and R 
second and total dual 
simulation. 

Employing (3.72) 
of n and < for a given 
namely: 


define the respective ratios between the first, 
g 

.evel efforts and that of the full nonsubstructured 


- ( 3 . 74 ), we need to establish the optimal choices 
B . Two approaches can be taken to achieve this. 
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i) Equalize problem sizes at both local and external levels; or, 
ii) Minimize R^ with respect to tc. 

For case (i) namely computer load equalization, we require that 

r - c (3.75) 

This leads to the expression 



(3.76) 


For case (ii), we require that 


h (K t/s> 5 0 

Recasting (3.74) in terms of < and B , we yield the relation 

, - K ,9 

R t/ g ' 4 - 5 r + ~ 

s g < 

In terms of (3.77), (3.78) yields the expression 



(3.77) 


(3.78) 


(3.79) 


Note while (3.76) balances the computing load among processors, 
(3.79) optimizes, i.e. minimizes the overall computational effort. For 
instance, noting Table 3.1, as would be expected, as problem size in- 
creases, the number of processors needed to balance local and external 
computing load increases. Concommitant with the increase in processors 
and problem size, we also see that the relative advantages of hierarch- 
ical parallelism also grow significantly. This is a direct outgrowth 
of the localized bandwidth minimization afforded by the procedure. 
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For the case of overall load minimization, i.e. (3.17), Table 3.2 
illustrates the improvements afforded by the current form of parallelism. 
As would be expected, (3.17) yields somewhat improved results over the 
load sharing scheme. This, of course, is at a cost of requiring a modest 
increase in the number of processors. A comparison of the two schemes 
illustrates that the increased number of processors is 
somewhat offset by the decrease in relative loading. In particular, 
comparing the ratios noted in Tables 3.1 and 3.2, it follows that 

the local processor load is reduced by 300 percent for the optimized 
scheme. In this context somewhat less powerful individual processors 
could be employed for the work load optimized approach. 

To yield more optimal results, a multilevel hierarchicalism may be 
employed. Considering a three level system. Figs. 3.7 and 3.8 . two 
levels of partitioning are required. After extensive manipulations, it 
follows that the asymptotic ratio between the straight classic and the 


hierarchical methodology takes the form 
R t/g “ C i/g + C e/g + C e/g 


wherein 


(3.80) 


C„ / *'• 
4/ S 


C e/g ' ( “ ) 


'e/g ' 2 ' B 


3>‘ 

(3.81) 

*2 

(3.82) 

B (< 7 ) 3 


g 3 


^3 

(3.83) 


such that (*.,)“. (* 3 )^ denote the number of substructure at the 2 nd and 

3rd levels. Furthermore C“, and C 3 , represent the calculation load 

e/g e/g 

defined by the 2 nd and 3rd levels. 
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To establish an optimal choice of substructuring, the appropriate 
values of and < 3 must be established. This is achieved by taking 
the requisite partial derivatives, namely 




) = 


0 


(3.84) 


fc; (, W - 0 


(3.85) 


Employing (3.80) - (3.85) we yield the expressions 
n 36 , 49 1 

0 % ~ . p i ^ 

(< 3 ) 4 (< 2 ) 5 A Bg (< 3 ) 3 

36 147 K _2 9 

' ' (.c/u/ " ‘ Bs ( '3> 2B * 


(3.86) 


(3.87) 


Solving (3.86) for < 7 yields the relation 


« 2 ) 5 - ^ * 

2 49 ic 3 


(3.88) 


Since < 2 is non-negative and real, (3.86) can be reduced to the form 


<2 ' 


144 Bg l 5 


49 k . 


(3.89) 


In terms of (3.89), we yield the following expression for <y namely 


1 1 
9 36_ , _49 ^3 n 5 147 ( 144 Bg ^5 

2Bg ^ ■ ) 5 < ' 144 Bg } 4Bg< 3 V 49 k 3 


(3.90) 
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As can be seen from the preceeding development, the three level archi- 


tecture yields a several fold improvement, namely: 

i) Significantly increased overall speed; 

ii) Loading per individual local processors can be reduced several 
orders of magnitude; 

iii) Due to reduced loading of local level, the associated proces- 
sors can be reduced in number and reused in a series format. 

In the context of iii), fewer processors are required by the local level. 
For the optimal case, since c ^ « C ” /g , rather than having 
separate processors for each substructure, some may be reused. For such 
situations, (3.80) takes the following modified form, namely 


2 3 

R , = f,K n C. / + f-K-, C / + C / 

t/g 1 2 ifg 23 e/g e/g 


(3.91) 


where here f J and f 9 denote factors defining the inverse ratio of the 
number of processors. In particular 


f x e 11, 1/< 2 1 

f 2 e (1, l/< 3 ] 


(3.92) 


In terms of the foregoing, the hierarchical strategy can employ a degree, 
of serialism without any real sacrifice in overall speed. 

Note the 3-D cube analogy shows even more significant improvements 
when the hierarchical substructuring strategy is employed. In this con- 
text, it follows that employing the concept of local bandwidth minimiza- 
tion in conjunction with the appropriate mix of local and external vari- 
ables, significant improvements can be achieved. These can be further 
enhanced as the number of levels is increased. 
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4. 


Algorithmic Considerations 


As noted in the previous section, the solution to say the two level 
architecture can be performed in several ways, namely: 

i) Direct at both levels; 

ii) Direct locally and iterative globally or vice versa; and, 

iii) Mixed direct and iterative at both levels. 


Since the purely direct method has essentially been outlined earlier, 
this section will consider cases ii) and iii). In this context, we shall 
develop mixed direct/iterative and purely iterative solutions to the two 

level formulation defined by (3.19). 

Note the local level associated with (3.19) involves taking the in- 
verse of the block diagonal matrix [K^J . In terms of (3.20) and (3.14), 
we see that the inverse of [K^] can be achieved as a number of independent 
partitions . Depending on the associated matrix conditioning, i.e. spectral 
radius, either direct or iterative procedures could be employed. In this 
context 


iV' 1 - 


’w 1 

[0] 

[0] 

(0] 

^db k db^ 

[0] 

CO] 

[0] 

r v 
l MB MB 


where for instance 

w 1 - 

'[^f 1 [0] 

[0] [ 3 Xj] 


(4.1) 


(4.2) 
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Here the various partitions making up » f DB K DB ] and t HF K MB 1 COuld be 

handled by entirely separate schemes, i.e.: 


i) Direct; or 
ii) Iterative, for example: 

• Jacobi 

• Gauss-Seidel 

• SOR 

• Steepest descent 

• Conjugate gradient 

At the global level, (3.19) can be solved via either a direct or 
iterative methodology. For the simplest formulation the Jacobi type 
method, (3.19) yields the algorithm 

IWi* 9 + "V + 'V»n C4 ' 2 


or after local inversion 

T n+1 - [ Kj ))” 1 « + («„] + [K L l)T n ) 

For the Gauss-Seidel methodology, (3.19) yields the expression 

T n+l Miy - i* t n' 1 (9 + iv i„> 

As will be seen in Part II of this series, since 


[K] = fK D ] - (fK L ) + fKy]) 

represents a regular splitting of [K], the Stein-Rosenberg theorem 

can be employed to show that the Gauss-Seidel version is superior to the 

Jacobi. 
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In the case of SOR type methodologies, (3.19) yields the following 
algorithm, namely 

?n + l ‘ “ 1K L> ?n + l ’ 

(l-uHKj,) T n + ulKy] T n + uQ 

such that the optimal choice of the over/under relaxation parameter w can 
yield super Gauss- Seidel convergence rates. Again, as will be seen in 
Part II, since f K] is cyclic, i.e. 2-cyclic, it follows that ioe(l,2). 

In the case that u = 1, the Gauss -Seidel method is retrieved. 

For the five point finite difference operator, prototypical simila- 
tions yield [K] which are Stieltjean type M matrices. As such, the con- 
vergence of the Jacobi and Gauss-Seidel schemes are guaranteed. This is 
also true of the SOR method for optimal choices of w. Note, if the SOR 
method is employed at the local level, each distinct block of [Kq] could 
have its own ui. Much of this will be formalized in Part II. 

Note each of the various partitions of [K^] are symmetric and positive 
definite. Hence, it follows that locally the conjugate gradient will 
yield convergent inverses. 

To yield a stable and efficient convergence process, the initial 
guess/starting values need to be relatively accurate. For example, 
the initial seeding of the iteration process can be achieved via a multi- 
grid type procedure wherein interpolation/ extrapolation is used to 

obtain the necessary information. For the case of local partitions in- 
volving direct solvers, the information flow would occur at partition 
boundaries. In the case of iterative solvers, information should be 
generated throughout the interior and on the boundaries of the pertinent 
partitions . 


34 



As will be seen later, in addition to optimizing bandwidth minimiza- 
tion, i.e. processor loading, the hierarchical methodology improves the 
convergence characteristics of iterative schemes. For instance, consider 
the conjugate gradient procedure. Like the steepest descent method 
the derivation of the conjugate method is rooted in the quadratic form, 
namely 

f " 2 + C (4 ' ?) 

where here we consider the inverse of the [jKj] diagonal block of [K^]. 
Overall the algorithms various steps are defined by the following sequence 
of operations namely 


■ «i>k + 

?k+l = Ik + T k f I K I J -k 

,i l , a A 

-k+1 “ ik+1 + P k-k 

such that here 


( 4 . 8 ) 


( 4 . 8 ) 


( 4 . 9 ) 


Note that: 

i) t 5 defines the optimal step size along the search direction 

A 

defined by d^ ; and, 

ii) b£ defines the parameter which yields the optimal search direction. 
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When the conjugate gradient method is applied globally, there exists 
but one set of and B k , i.e. the global set. In this context they re- 
flect the needs of the globally assembled set. Prototypically, the needs 

of a given partition may differ hence leading to convergence problems. 

2, i 

For the current hierarchical methodology, the P a * r i- s optimized 

on a local partition basis. In this context they reflect the intrinsic 
substructural requirements. As will be seen, this greatly improves the 
convergence characteristics of the overall problem. This is especially 
true for problems wherein the conjugate gradient methodology is used 
locally and say the SOR is employed globally. 

To improve the overall convergence process, prototypically after the 
initial seeding, the local iteration process is allowed to converge to a 
predetermined accuracy. Subsequently, the global iteration is commenced. 
This enables the refinement of the seed values. As noted earlier, the 
Part II of this series will develop the formal properties of the foregoing 
solution procedures. This will also include benchmarking. 
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Table 3.1 - Computational Effectiveness for Load Balance Equalized 
System: Two Level Hierarchy. 


B 

g 

K 

2 

K 

n 

R e/g 

R t/g 

'W ' 1 

500 

3.98 

15.8 

125.6 

.0358 

.0717 

13.9 

5000 

6.31 

39.8 

792.4 

.00568 

.0113 

88.5 

10000 

7.24 

52.5 

1 

1379.0 

.00326 

.00652 

153.4 

50000 

10.0 

100 

5000 

.0009 

.0018 

555.5 

L00000 

11.48 

131 

8710 

.000516 

.00103 

967.7 
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Table 3.2 - Computational Effectiveness for Effort Optimized System: 
Two Level Hierarchy. 


B 

8 

1C 

2 

K 

n 

R / 
e/g 

R t/g 

'W 1 

500 

5.25 

27.5 

95.2 

.0118 

.059 

16.9 

5000 

8.32 

69.3 

600.9 

.00187 

.00935 

106.8 

10000 

9.56 

91.5 

1046 

.00107 

.0053 

185.9 

50000 

13.2 

174 

3787 

.000296 

.00147 

677.3 

100000 

15.2 

229 

6579 

.000168 

.00085 

1178.4 
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Figure 3.1 - Partitioned Physical System 



LEVEL 



Figure 3.2 - Multilevel Hierarchical Scheme 
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Figure 3.3 - Two Level Architecture 







Figure 3.5 - Second Level Extemal/Global Model 
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Figure 3.6 - Families of Verticals and Horizontals of Second External 
Global Level. 
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Figure 3.7 - Three Levels Hierarchy 
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PART n - FORMAL CONSIDERATIONS AND BENCHMARKING 

SUMMARY 

In Part I of this series, a hierarchically parallel finite difference modeling methodology and 
associated solution scheme was developed. The main purpose was to establish a logical and 
efficient procedure for use in parallel computer environments. In this part, consideration is given 
to establish the formal numerical properties of the scheme. Specifically, several theorems and 
associated proofs are discussed which consider the convergence characteristics of the various 
solutions algorithms developed in association with the parallel methodology. These are backed 
up with large scale benchmark experiments performed on a VAX 785, a vectorized IBM 3090- 
200 and a CRAY-XMP. 
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1. INTRODUCTION 

As noted in Part I, the main thrust of this series has been 
the development of an hierarchically parallel modelling and 
solution strategy for finite difference analysis of heat 
conduction. The basic emphasis of Part I is to formulate the 
overall strategy and solution methodology. In this part, formal 
consideration and benchmarking will be undertaken. 

Included for consideration will be an evaluation of: 

i) The formal structure of the governing heat conduction 
equations and associated solution algorithms; 

ii) The convergence properties; 

iii) Spectral properties; and 

iv) A comparison of convergence rates. 

The benchmarking of the parallel scheme will consider the 
various versions of the procedure, i.e., the direct, mixed, and 
purely iterative formulations. 

Note, the formal evaluation of properties will employ the 
various theoretical aspects of P-cyclic, Stieltjean, nonnegative 
and M-matrices [1,2,3]. Overall this will enable a very detailed 
and comprehensive evaluation of the requirements for convergence, 
the spectral properties of the various versions of the parallel 
algorithms as well as the relative behavior among the various 
f ormulations . 

Overall this part is organized into three major sections 
which: 
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i) overview the algorithmic structure and properties of 
the various coefficient matrices; 

ii) give a detailed consideration of the formal numerical 

properties and 

iii) thoroughly benchmark the various versions of the scheme 
on IBM 3090. Vax 895 and CRAY XMP systems. 

2. ALGORITHMIC OVERVIEW 

Recalling Part I of the paper, the global assembled version 
of the governing FD equations of heat conduction take the form 



where [Kp] . [Ky] . and [KJ respectively contain the block 
diagonal, upper and lower triangular elements of [K] the global 
conductivity matrix. Overall the forms are given by the 
expressions 


CK d 3 = 


w 

[0] 

[0] 

[0] 

^db k db^ 

i — » 

O 
i— i 

1 — 1 
O 

L-J 

[0] 

£mb k mb^ 


( 2 . 2 ) 
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CK„] = CK l ] - 

[°] ^i k db^ 

[ 0 ] [ 0 ] 

[ 0 ] [ 0 ] 


Equation (2.1) can be solved in either of three ways, 
namely : 


i) 

direct 

inversion : 

ii) 

purely 

iteratively or; 


iii) by mixed direct and iterative procedures. 

As seen earlier, this can be achieved at a variety of 
hierarchical levels. For such a formulation, the various 
partitions of [K^] can be interpreted as follows, namely: 
i) [j K l^ is a dia S onal block matrix whose diagonal 

elements are nonsingular submatrices corresponding to 
the internal variables of each substructure, 
xi) * S a dia S ona * block matrix whose diagonal 

elements are nonsingular submatrices corresponding to 
dual shared boundaries between submatrices: and 

iii) [ K U1J 1 is a nonsingular diagonal matrix corresponding 
to multiply shared boundaries. 

In a similar context. [jK^]. [jK^] [ DB K MB ] and their transposes 

define the assembled connectivity matrices linking the various 
substructure . 

For the direct procedure. (2.1) can be solved either in its 
global form or at the various subs true tural levels using static 


Cj K hB^ 

£db k mb^ 

[ 0 ] 


(2.3) 
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condensation and forward elimination and backward substitution 
steps as discussed in Part I. 

In the case of matrix iteration, as noted earlier, either a 
nixed or fully iterative solution methodology can be employed. 
Overall the solution process involves several levels. These 
primarily consist of local and global phases of calculation. At 
the global level, an iterative scheme is employed, i.e.. either 
Jacobi. Gauss-Seidel or SOR methods [1.2]. For the local level, 
inverses of the various subs tructural matrices can either be 
obtained via direct means or by say the robust preconditioned 

conjugate gradient methodology [4,5]. 

As an example, consider the Jacobi type of a global 

formulation. Here (2.1) takes the form 

tV In+1 * (tV + CV> In * 2 (2 ' 4) 

or more directly 

T n+1 = [K^- 1 «V * tV>In * 2> (2 \ 5) 

Overall, (2.5) consists of two levels of calculation, namely: 

i) The local inversion of [K^] . and 

ii) The evaluation of T n+1 after the appropriate matrix 


multiplication steps. 
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Note the inverse of [K^] can be established partition by 
partition. This is possible since there is no coupling between 
the various blocks making up [jKj]. Cdb K Db 1 and ^MB K MB^ ' Hence - 

[Kp]' 1 = fCi K i]" 1 C°3 C°3 ] (2.6) 

C°] Cdb^bT 1 [0] 

[o] CO] C mb k mb ] _1 

As noted earlier, the inversion of the various blocks can be 
achieved either directly, or via mixed direct and iterative 
schemes. Here the conjugate gradient method with preconditioning 
could be applied to great advantage especially for well 
condi t ioned parti t ions . 

For the Gauss-Seidel styled methodology, (2.1) takes the 

form 

I„+l = CCK d ] - [KJ)- 1 ([ Ku ]T n ♦ Q) (2.7) 

Here the inversion process is somewhat more awkward. 

To determine the various formal numerical characteristics of 
such algorithms, the properties of the various coefficient 
matrices must be defined. To simplify the development, without 
loss of generality, we shall consider formulations involving the 
use of 5-point 2-D and 7-point 3-D finite difference 
representations £1,23- In this context it follows that [K3 
positive definite Hermitian and [K^] . [Kjj] are nonnegative 


53 



strictly lower and upper triangular matrices. Such properties 
also apply at the subs tructural level. The next section will 
employ these properties to formally illustrate the convergence 
characteristics of the hierarchical methodology. 

3. FORMAL CONSIDERATIONS 

Overall the formal considerations will have four main 
thrusts , i . e . ‘ 

i) convergence properties; 

ii) spectral properties; 

iii) comparison of convergence rates among full iterative 
and mixed approach, and; 

iv) local global attributes. 

This will be established in a series of theorems and their 
associated proofs. 

First we will show that the global Jacobi and Gauss-Seidel 
(GS) iterative method defined by (2.5) and (2.7) respectively 
converges for any arbitrarily initial vector T q . i.e.. if T is 

the true solution of (2.1) and (T n > are the sequence derived from 

(2.5) or (2.6). then lira T r = T. 

n ■* • ~ ~ 

From (2.5) and (2.7), the Jacobi and GS iteration matrices 
respectively are given by 

J = [Kp]- 1 <[K l ] + [*„]) (3- 1 ) 

a = <[k d ] - CHj]- 
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It is well known that an iterative method converges 

theoretically to the solution for any arbitrarily starting vector 

T if and only if the spectral radius, p(G), of the iteration 
~o 

matrix G is less 'than unity. Here 


p (G) = Max |X |. X. is an eigenvalue of G. 

HHn 

Before giving results on the convergence of the block iterative 
method, we give the following definitions [1]. 


Def ini tlon . A real n x n matrix A = (a. ^.) with , . < 0 for 

all i ?! j is an M-matrix if A is nonsingular, and A 1 > 0. 

Definition . A real n x n matrix A = (a. ^.) with a. . ^ i 0 for 

all i ?! j is a Stielties Matrix if A is symmetric and positive 

def ini te . 

Definition - For n x n real matrices A, M. N. A = M-N is a 
regular splitting of the matrix A if M is nonsingular with M 2 
0 , and N > 0 . 

Suppose a matrix A has the following special partitioned form 
(block tri-diagonal): 


A = 


A l.l A 1 . 2 Q 

A 2 . 1 A 2 . 2 A 2 , 3 

° V , 


(3.2) 


where the diagonal submatrices A t i . Hi^q are square and 
nonsingular. This give rise to a block Jacobi matrix of the form 
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0 B 


1.2 


o 


J(A) = 


B 2 . 1 0 


B 


2.3 


(3.3) 


o 


B 


q-l-q 


B 


q.q-i 


corresponding to the partitioning of A. Note J(A) is defined as 
weakly cyclic of index 2 and A as 2-cyclic. 

Let J ( A) = L + U, where L and U are strictly the lower and 

upper triangular parts of J(A) in block form. 

Def ini tion . If the matrix A of (3.2) is 2-cyclic. then the 
matrix A is consistently ordered if all the eigenvalues of the 

matrix 


J (A) = aL + a , (3.4) 

are independent of a, for a A 0. 

Now we give the following convergence theorem for the Jacobi 

and Gauss-Seidel iterative methods. 

Theorem A. The block Jacobi and block Gauss-Seidel iterative 
methods defined by (2.5) and (2.7) are convergent for the block 
system of linear equations (2.1) for any initial vector 
approximations T q . Moreover, the Gauss-Seidel Method converges 

faster than Jacobi method. 
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Proof . For the matrix [K] of (2.1) derived from the five point 
finite difference formula applied to self-adjoint elliptic (Heat 
conduction) equation, it is known from Theorem 6.4 of Yarga [1] 
that [K] is positive definite Hermitian and a Steiltjes matrix. 
We remark that every Steiltjes matrix is an M-matrix. 

Def ine 


M i = N 1 “ + : 

m 2 = {[k d H " n 2 = [K u ] - (3 ’ 5) 

Since [K] = [K^] - [K L ] - [Ky] is an M-matrix. It follows from 
Theorem 3.14 of Yarga [1] that Mj and Mg are also M-matrices. 
Thus, it follows that [K] = Mj - Nj = Mg - Ng are two regular 
splittings of [K] - Moreover Nj > Ng ^ 0. Hence by theorem 3.15 
of Yarga [1], 

0 < p (Mg 1 Ng) < p (M" 1 N x ) < 1 (3.6) 

From (3.1). it follows that 

J = Mj 1 H r (3.7) 

S = M^ 1 n 2 . 

Thus. 

0 < p (tf) < p (J) < 1. (3.8) 
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this implies that both the Jacobi and Gauss-Seidel iterative 
methods converge. Moreover Gauss-Seidel Method converges faster 

than Jacobi Method since p(2) < p(J) 

In view of the above theorem, the Gauss-Seidel scheme offers 

improved convergence rates over the Jacobi version. If the 
over/under relaxation parameter is properly defined, the SOR 
prototypical ly yields improved results over the Jacobi and 
Gauss-Seidel methods. For the current hierarchical formulation, 
the SOR takes the following form namely 

([K d ] - O [K l ] | T n+1 = (l-“) I„ + "£ K U ] In + " S (3 ’ 9) 

where « is the relaxation parameter. 

If we use the five point finite difference scheme to 
simulate the heat conduction equation, then [j K ng] = 0 in 
equation (2.3). Hence, from the earlier definition, it follows 
that [K] = [K d ] - [K l ] - [Ky] is a 2-cyclic consistently ordered 
matrix. The SOR iteration matrix corresponding to (2.1) is then 
defined by 

= ([K d ] - c, [KJ)' 1 ((!-") [K d ] + " CK U ]}. (3.10) 

The convergence and optimal u for the SOR method defined by (3.9) 
is given in the following theorem which involves the properties 
of 2-cyclic and consistently ordered matrix theory [1.2.3]. 
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Theorem B . Let the matrix [K] = [Kp] - [K^] - [K y ] of (2.1) be a 
consistently ordered 2-cyclic matrix with nonsingular diagonal 
submatrices [K^,] . If all the eigenvalues of the second power of 
the associated block Jacobi matrix J are real and non-negative, 
and 0 £ p(J) < 1. then with 

u b = 2 (3.11) 

1 + V 1 - p 2 (J) 

it follows that 


p(*„) - ("b' 1 ’ 


and 


PC*,) > P(* > 

D 

for all u j* u, : Moreover, the block successive overrelaxation 

b 

matrix 2 is convergent, i.e., p(i£ ) < 1 . for all u with 0 < cj < 

(i) W 

2 . 

Proof . Note from Theorem A, we have 0 < p(J) ^ I* Since the 
block Jacobi matrix J is symmetric, all its eigenvalues are real. 

ft 

Hence eigenvalues of J will be non-negative. Thus, all the 
conditions of Theorem B are satisified. 

Since the Gauss-Seidel procedure converges faster than 
Jacobi’s, we shall give an outline of the parallel version of the 
block Gauss-Seidel iterative method for the five-point finite 
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difference formula. It is expensive to find the optimal u for 
the SOR iterative method. However, we also give the analogue 
version for the SOR method. 

There are three levels of computations. Level 1 consists of 
solving for all unknowns inside the substructures. Level 2 
consists of all points on the double boundaries and level 3 
consists of all points on the multiple boundaries. 

Equations (2.1) can be written in the block form as follows: 


’ [I K I] 

-[I K DB] 

[0] 


r 

M 

H \ 

V 


( 

OX 

\ 

-[I K DB^ 

[DB K DB] 

-[DB K MB] 

< 

T 11 

) - < 

Q 11 ' 

[0] 

- ^db k mb^ 

1 — 1 
w 

w 


T 111 

L j 


ql I I 

c J 


Here superscript denotes the levels. 
Block Gauss-Seidel Version (Algorithm 1) 
For n = 0. 1 . 2 ... 

Solve : 

!■ W lUn = !(»> + ~ ' 


(3.13) 


2. C DB K DB 3 t“ +1) = Ifyf lj n+1) - C m K„3T$ * 2 11 . (3-H) 

(3.15) 


3- T?, 11 


.II 


MB a HB j i(n+l) C DB K HB 3 I(n+1) + 2 


III 
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Rlock SQR Version (Algorithm 2) 


!• W I(n + 1) 

2 ' CdB K DB^ I(n+1 ) 


3 - ^mb k mb 


1 T 111 
J I(n+1) 


(1-<j) CjKj] T^ n) + "CjKjjb] T (n) + u 2 

(3.16) 

= u Ci K DB^ I(n+1) + (1_w) ^DbSb^ 1 I(n) (3,17) 

+ ^db k mb^ ill ) + « S 11 ' (3 ' 18) 

= U ^*DB^MB^ I(n+1) + (1_u) [ MB K HB ] I(n) 

+ u Q 111 (3.19) 


The number of processors required to solve (3.13) are equal to 
the number of substructures. Number of processors required to 
solve (3.14) are equal to the number of double boundary lines. 
Finally, the number of processors required to solve (3.15) are 
equal to the number of multiple boundaries. The same number of 
processors are also required for the SOR method. 

Note, theorems A and B are based on the assumption that the 
inverse of [jKj], C DB K DB ] and [ MB K MB ] are obtained via direct 
methods, i.e. Gauss elimination. Cholevsky decomposition etc. 
However, as will be seen later, such local inverses can be 
obtained by a convergent iterative scheme, i.e. the conjugate 
gradient scheme. This gives the overall scheme two phases of 
iteration, the local and global. 
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The previous formalism pertained primarily to the global 
level of the iteration process. As such, it is uneffected by 
converged local calculations of [Kp] Specifically the global 

level formalism remains intact if: 

i) The complete inversion of [Kp] is performed directly; 

ii) The complete inverse of [Kp] is obtained via a 
convergent local iteration process and; 

iii) If the inverse of [K^] is obtained via a mixture of 
direct and convergent iterative scheme. 

Note since all the various partitions and associated blocks 
of [K d ] are Stieltjean. convergence is guaranteed for such 
iterative methodologies as: 

i) conjugate gradient with and without preconditioning, 
or ; 

i i ) the more classical Jacobi and Gauss Seidel and SOR 
schemes . 

With the use of five-point 2-D and seven-point 3-D 
difference representations, the preconditioning follows directly 
from the structure of the partitions of [K^J. For instance, 
considering the five-point 2-D difference formulation, the 
various substructural blocks making up the CjK^] partition of 
[K D ] are five diagonal. For this case, the preconditioning used 
in conjunction with the conjugate gradient method would be 
structured accordingly, i.e. to preserve the five diagonal 
f ormat [5] . 
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As will be seen during the benchmarking procedure, for the 
mixed and completely iterative hierarchical methodologies, the 
overall iteration process can be performed in several different 
ways , i . e . : 

i) for each cycle of the global iteration, all the 
distinct blocks are iterated until locally converged; 

ii) for each cycle of the global iteration, various of the 
local iteration processes are consequential that is, 
for certain designated blocks, each cycle of local 
iteration is followed by a global one, and; 

iii) each cycle of all iterated blocks are followed by a 
global iteration. 

Note the proceeding theoretical development guarantees the 
convergence of case (i) schemes. As will be seen from the 
benchmarking, during the initial phases of the solution process, 
case (i) iterations are employed to obtain a better approximation 
to the solution. Once the desired accuracy is achieved, case 
(ii) and (iii) procedures can then be used to complete the 
solution. Such an approach is particularly advantageous to use 
when conjugate gradient procedure are employed locally. This . 
follows from the fact that the rate of convergence of the 
conjugate gradient method improves in close neighborhoods of the 
solution. Recalling Part I, the basic conjugate gradient 
algorithm [6] takes the following form locally, namely: 
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e 

T k 


( 3 . 20 ) 


- ll*>k + T k5k 

( 3 . 21 ) 

e 

Sk+l 

= £k + T k 5k 

( 3 . 22 ) 

p k 


( 3 . 23 ) 

2k+l 

£k+l + ^k+1 £k £k 

( 3 . 24 ) 


such that here the gradient g^ is defined by the expression 

£k - ^ tr*) k (3-25) 

Specifically g^ represents the gradient of the quadratic form 

f = (lf) / C I K f3 (Tf) - (Tj/ Oj (3.26) 


£ 

Typically, for the method of steepest descent, g^ is used to 
define the search direction. For such an algorithm the 

p 

subsequent g^ are mutually orthogonal, i.e. 


p / p 

( £k> Sk-i 


0 
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For the conjugate gradient version of the scheme, the search 
direction is modified in the manner defined by (3.24). Since 
is chosen to satisfy optimality conditions, the subsequent search 
directions are no longer perpendicular but rather are free to 
range within the optimal bounds. 

p 

Overall r* defines the optimal step size along the optimized 

p 

search direction defined by /? k . Note, these parameters pertain 
to the i th subdomain. Since the [jK^] may all be distinct, the 
associated search direction and step sizes may all vary. This 
also applies to various diagonal blocks making up an< * 

[ K UIJ ]. The overall family of r and P represents one of the 

distinct advantages of the hierarchical scheme. Namely, rather 
than one global (r, P) pair, the hierarchical scheme provides for 

0 p 

locally defined (r . P ) pairs, i.e. for the sets of 
subs true tural interior degress of freedom, as well as for dual 
and multiply connected boundary variables. Since such pairs more 
properly reflect the local optimality conditions required for 
local convergence, the stability and efficiency of the CG is 
greatly enhanced. Such advantages can be further enhanced 
through the use of a preconditioned version of the CG , i.e. , the 
PCG. This will be discussed in the next section. 

As noted earlier, similar comments apply to the SOR scheme. 
Namely, distinct could be generated for each of the various 
substructure and dual and multiply connected diagonal blocks. 
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the 


In addition to enabling localized optimization, 
hierarchical methodology tends to provide a means to decrease the 
effects of roundoff generated in all computer hardware. This 
applies both to the direct as well as iterative schemes. For the 
direct method, roundoff is decreased due to the significant 
reductions in bandwidth and problem size associated with 
substructuring Another improvement in the roundoff problem 
follows from the fact that the partitioning process associated 
with the hierarchical strategy tends to zonalize such effects. 
This is especially true for the iterative schemes. For instance, 
recalling the various inner products associated with the CG . i.e. 
those defined by (3.18) and (3.21). partitioning can introduce 
significant reductions in the associated roundoff. Due to the 
substructuring process, such roundoff is somewhat zonally 
contained . 

Such roundoff containment will have very significant impact 
on computer hardware. Specifically, to contain roundoff, many 
significant places must be carried for each number stored. This 
severely impacts memory data transfer, arthimetic operations, 
etc. By reducing roundoff, the hierarchical scheme can increase 
usable memory as well as simplify arthimetic and data transfer 
and storage operations. 
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4. HIERARCHICAL PRECONDITIONING 

To enable a further enhancement of iterative schemes, often 
times preconditioning is employed to associated matrix equation. 
This idea has been commonly used [4.5. 7,8,9] in conjunction with 
iterative method such as conjugate gradient and symmetric 
successive overrelaxation (SSOR) methods at the global level. In 
the context of the hierarchical methodology developed herein, 
separate individualized preconditioning can be applied to each 
distinct local substructures. In this context, the appropriate 
local characteristics can be taken into account. In what 
follows, the concept of hierarchical preconditioning will be 
developed. For demonstration purposes, the preconditioning for 
the CG iterative method (PCG) will be discussed since it has a 
number of attractive properties such as 

i) it does not require any estimation of iteration 
parameter , 

ii) it takes advantage of the distribution of the 
eigenvalues of the iteration operation. 

iii) it requires fewer restrictions on the matrix for 
optimal behavior than does such methods as the SOR 
me thod . 

For the sake of notational convenience, let us assume that 
the system of linear equations corresponding to a given 
substructure are defined by 
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A x 


b 


(4.1) 


where A is a m x m positive definite Hermitian matrix. Let 
A = H - N be an incomplete factorization of A [10] such that N is 

the error matrix and 

M = L U (4.2) 

where L and U are lower and upper triangular matrices and M is 
nonsingular . Note L and U are selected so as to possess 
approximately the same sparse structure as the original matrix. 
This is in constrast to direct factorization which yields densely 
filled subdiagonal with the upper bounding bandwidth. Such a 
choice usually yields improved convergence characteristics. 

The error matrix 

N = M - A (4.3) 

is acceptable if it reduces the spectral condition number which 
is the ratio of the extreme eigenvalues of (LU) A. Note 
reductions in the condition number enhance the rate of 
convergence of the conjugate gradient method. Before giving the 
error estimates, we give the algorithm for the PCG applied to 
(4.1) [4.5]. 
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PCG Algorithm. 


Le 


t x (0) be a given vector and arbitrary define the vector 


,C-1) 


For k = 0.1.2,... 
(a) solve 


L U z^ = b - A x^ 


;( k ) L U 


(b) compute b k = y 2 (k-TT 


. k * 1 


b = 0. „< k ’ = *(“> + b kq < k -') 


z ( k ) / l U z 


(k) 


(c) compute a. 


: q < k >' A .,< k > 

ow **+ 

x C k+1 ) = x< k > + a k q< k > 


For the model problem, there are several ways to obtain the 
incomplete factorization [4.7,10] of A in the form 


A = LL - N. 


(4.4) 


We will use a particular one given by Krishna [4] since it is 
inexpensive and more stable. The outline is given below. 


69 




Let us represent the graph of nonzero entries of L corresponding 
to (i.j) mesh point by 

L : (i.J) 

S i-l.j V i.j 

(i. j-1) 

*i . j-1 
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T 

Then the nonzero entries of LL are given by the graph 


T 

LL 1 : 



v . . , t . . , 

i . J -1 1 . J -1 


s . . , t . . , 

1 . J-1 1 . J-1 


We define v. . . s. . and t by 

1 t J ^ * J 


v . . 

i . J 


= / 


s . = 

i . J 


t . . 

i . j 


b. . 
i . J 

±jJ 


2 

s . , . 

i-l . J 


t 2 . . 

i . J-1 


V i . J 

1±l± 

V i. j 


(4.4) 


It has been shown in [4] that v ± . > °. We remark that in the 
error matrix N we have at most two non-zero entries in each row. 
In step (a) of conjugate gradient algorithm, we can obtain the 
unknown vector z^^^ very easily by using back and forward 

subs t i tut ion . 
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For the error estimate let the weighted error function after 
(£ + 1) iterations be given by 


e(xt‘ +l >) = i(x - x“ +I >) A(« - x 

Then it is known [4] that 


(« + l) 


) ( 4 . 5 ) 


, (£ + 1) 

e( ^ A-l 2 ^ +1) 

£ 4 ~) 


e(x^°^ ) 


+ 1 


(4.6) 


' -1 

where k is the spectral norm of (LL ) A. 


5. BENCHMARKING 

In Part I of this series, a hierarchically parallel 
modelling methodology was developed. Overall, a given problem is 
partitioned into a number of separate substructure each with its 
own distinct internal and interconnecting conductivity matrices. 
In a parallel computer environment, these can be formulated 
simultaneously. As has been seen, the solution to the problem 
then contains at least two phases, i.e. the local and global 
wherein each substructure can be evaluated in a distinct 
processor. The solution to the resulting formulation would then 
be achieved either by: 

i) static condensation and back substitution locally and 
global assembly and direct solution of the resulting 
statically reduced global matrix; 
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ii) static condensation and back-substitution locally and 
global assembly and iterative solution of statically 
reduce matrix globally; 

iii) statically reduce and back-substitute certain chosen 
local substructure while iteratively solving others in 
conjunction with global level iterations and 

iv) perform local parallel iterations approximately 
sequenced with the global phase of iteration. 

For problems with poor conditioning, i.e. large spectral 
radius, method i) would yield the most stable results. In the 
case that certain substructure are well-conditioned spectrally, 
then methods ii) and iii) would be of greater advantage. Lastly 
if all the various subs true tural partitions are spectrally 
well-conditioned, then method iv) would be of greatest advantage 
This follows from the fact that the amount of data flow between 
substrucure and the global level is reduced in the flow of 
calculations associated with the iterative process. Rather than 
whole matrices, pre and post multiplication reduces most of the 
data flow to vector form. Secondly, since the iteration process 
preserves the original matrix sparcity, the storage requirements 
are significantly reduced. Last, but not least, the direct 
method suffer from the fact that as problem size grows, roundoff 
is significantly accelerated. Hence, it requires more 
significant places thereby further reducing available machine 
storage . 
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To benchmark the hierarchical procedure described earlier, 
the following example problem was chosen namely* 

i) For all o < x < 1, 0<y<l, 

v 2 T = 2[(5 + x + y) sin(x + y) - 2 cos (x + y)] (5.1) 

v 2 ( ) = —2 ( ) + *~Z ( ) (5 - 2 ’ 

a-K a y 

ii) For all 0<x< 1. y = 0 or 1; 

T = (5 + x + y) sin(x + y) (5.3) 

iii) For all 0 < y < 1 . x = 0 or 1 ; 

T = (5 + x + y) sin (x + y) (5.4) 

During the evaluation phase, coarse to very refined FD 
meshes were tested. The most refined model consisted of some 
250.000 mesh points representing a like number of total 
equations. Program development and testing was initiated on an 
IBM 3090-200 with a vector facility. To enable the largescale 
evaluations, the program was migrated to a CRAY XMP with an 8 
megaword memory, i.e.. The University of Pittsburgh machine. 
While actual parallel processing was not possible (except for 
four partitions, i.e. CRAY limitations) the overall scheme was 
run sequentially, i.e. local phase requisite assembly, data flow 
and global phase. 
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Three types of solution procedure were considered, namely: 

i) direct global as well as local, i.e. 
condensat ion-backsubs ti tu tion locally and direct 
solution of reduced global matrix; 

ii) locally direct and iterative globally and lastly; 

iii) local iteration and global iteration. 

Note all the local iterations were performed employing the 
preconditioned conjugate gradient methodology [5] noted earlier 
Here we recall that the preconditioner preserved the five 
diagonal form of the various subs true tural blocks making up the 
partition. 

Note for all test cases involving either global on local- 
global iterations, an initial guess was employed along the 
boundaries of the various substructure. The guess was obtained 
by employing a course mesh to generate a preliminary solution. 
This was then extrapolated/interpolated onto the boundaries of 
the substructure, i.e. on their associated boundary mesh points. 

As will be seen, such an approach greatly improves the 
numerical efficiency of the iterative scheme. This follows for 
several reasons, namely: 

i) if direct inversion is used at the local level, the 
roundoff error associated with the backsubs t i tu t i on 
procedure tends to destabilize the global iteration 
process for very largescale problems. In this context 
seeding the solution with a reasonable initial guess 
tends to limit the roundoff; and 
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ii) if iterative techniques are employed at the local 
level, say the conjugate gradient, the convergence 
process is significantly speeded up; this is a natural 
consequence of the fact that such schemes tend to 
converge more rapidly in small neighborhoods of the 
solution. 

Based on the foregoing, the main thrust of the benchmarking 
is several fold, i.e.: 

1) to establish the feasibility of the hierarchical scheme 
to handle large scale simulations in a parallel 
setting; 

(2) to compare the direct, mixed and fully iterative 
versions of the strategy; and 

3) to compare the parallel and traditional nonparallel 
solution algorithms. 

Note, the main purpose of the comparisons between the various 
parallel schemes and the traditional nonparallel approach is to 
ascertain whether any improvements in convergence rate, stability 
storage requirements and run times are obtained. 

For the present purposes, convergence is ascertain by 
employing the normed ratio test. In terms of the hierarchical 
partitioning of T we recall that 
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where T t denotes all the internal points of the substructures and 

T__ and Tw„ correspond to the points on the double and multiple 

boundary^ respectively. 

The ratio test takes the form 


'I In+1 - In H 2 

< TOL. (5.6) 

Uln+lM* 


To quantify which of the various subs tructural partitions are 
encountering convergence difficulties, local checks can be 
undertaken. This is achieved through the use of the expression 




< TOL 


(5.7) 


such that i e[l, L] where L denotes the number of subs tructural 
parti tions . 

Based on the foregoing. Tables 5.1 and 5.2 illustrate 
various aspects of the convergence requirements of the mixed and 
purely iterative schemes. For instance, considering the case of 
local direct calculations. Table 5.1 illustrates the iterative 
requirements. As can be seen, the requirements remain 
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essentially stable inspite of the rather dramatic increases in 
problem size. For the four subdomain problems considered in the 
parallel mode, the overall running time was essentially 1/3 that 
of the sequential version of the partitioning. The improvement 
^ 3 S a direct outgrowth of the parallel calculation of the local 
inverses. While in theory, an even better improvement should 
have been afforded by the parallelism, the recurrent 
backsubs t i tu t i on and overall overhead due to data flow increased 
the time requirements. 

Note, the locally iterative version of the partitioned 

methodology represents a significantly smaller storage burden 

over the purely direct approach. In particular, for a variable 

property, problem defined by an (n.n) square region. 

decomposition into (m.m) partitions reduces the storage 

requirements by a factor of 0(l/m) such that the total is 

3 

proportional to the ratio 0{n /m). In the case of uniform 

3 

properties, the reduction is proportional to 0(l/m ) where in the 
total storage is 0(n /m ). For the four partition benchmark 
problem just considered, the storage needs are essentially 1/8 
that of the straight global approach. This significant reduction 
enabled the running of even the largest problem (250,000) in the 
core of the 8 megaword CRAY. Such storage efficiencies enable 
the maximized usage of the highspeed core. As is well known, 
once secondary storage is required, i.e. , hard and solid state 
disks, the resulting out of core solution, is generally very 
expensive . 
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Table 5.2 illustrates the convergence requirements of the 
totally iterative scheme. Seen graphically in fig. 5.1, it 
follows quite surprisingly that for the given benchmark problem, 
that proportionately less iterative burden is encountered as 
problem size was increased. In particular, as can be seen, the 
problem size to iteration count is a softening curve. Beyond the 
improved numerical efficiency, the parallel methodology enabled 
the solution of problems whose size yield either unstable or 
significantly less efficient iterative processes. In this 
context, the use of hierarchical parallelism: 

1) significantly improves the stability of the iterative 
approach : 

2) reduced computational time due to the capability of the 
procedure to perform simultaneous calculations, i.e. 
the numerical effort for the four partition test care 
was essentially 1/3 that of the full formulation per 
iteration; and 

3) improved iterative efficiency. 

Note, the improved iterative efficiency is a direct 
outgrowth of the partition size reduction introduced by the use 
of parallelism. Additionally, it should be noted that if the 
proper ratio between internal substructural and boundary mesh 
points is obtained, the overhead associated with the data flow 
between levels can be significantly minimized. 
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In comparing the mixed and completely iterative schemes, it 
should be noted that 

£ q i» very large scale problems, the mixed method is 
somewhat sensitive to roundoff error; this is an 
outgrowth of the direct inversion employed at the 
various local level substructural partitions; for large 
scale partitions the number of calculations performed 
during the forward elimination and backsubsti tution 
phases of calculation tax the place accuracy of even 
the CRAY system; 

2) the direct inversion phase of inversion is inherently 
more storage intensive than the iterative scheme; 

3) for spectrally ill-conditioned partitions, the direct 
method can prototypically bypass problems of iterative 
efficiency and stability; 

4) due to roundoff errors, the mixed methods are somewhat 
more sensitive to inaccuracies in the starting guess 
defines along substructural boundaries. This follows 
directly from the fact that roundoff initially 
generated in the foward step and continuously in the 
backward phase act to disturb the overall convergence 
process; such behavior is clearly demonstrated by the 
fact that modest changes in the initial guess accuracy 
(10%) causes major increases in the iteration count of 
the mixed method wherein the global level is iterative 
while the local is direct. 
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6. CONCLUDING REMARKS 

Parts I and II of this series of papers has developed a 
hierarchically parallel modelling methodology and associated 
solution procedure. Overall, the procedure enables a logical and 
efficient use o-f parallel computer environments. The scheme 
provides a wide variety of solution procedures including direct, 
mixed, direct-iterative and completely iterative type procedures. 
Note, due to the local partitioning afforded by the parallelism, 
the overall stability and efficiency of the iterative phases of 
computation are greatly enhanced over the classical full single 
level modelling approach. As has been seen in this part of the 
series, such behavior has been both formally and empirically 
ver i f ied . 

Note, due to the manner of organizing the scheme, it can be 
directly incorporated in conjunction with a wide variety of 
general purpose FD codes for example CINDA [12]. Such an 
undertaking would reduce a code like CINDA to a subroutine 
residing at a given parallel processor. Here it would generate 
the appropriate governing FD equations for the given 
substructure. These would then be locally solved either 
iteratively or if ill-conditioned directly. The upper solution 
would be established via a global level which performs the task 
of overall problem assembly, direct or iterative solution as well 
as data transfer between iterations and among the various 
substructual components. In such an undertaking, the main task 
would be to develop the upper level code. 
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Note in future activities, the current approach is being 
generalized for use in finite element type applications. Work is 
also ongoing to adapt the methodology to use in nonlinear 
pr ob 1 ems . 
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Table 5. 


Problem Size 

h = .005 
39601 DOF 


h = .003125 
101761 DOF 


h = .00270 
136161 DOF 


DOF - Degrees 


Table 5.2 


Problem Size 


h = .01 
9801 DOF 


h = .005 
39601 DOF 


h = .003125 
101761 DOF 


h = .0025 
159201 DOF 


h = .002 

249001 DOF 


- Iteration of Requirments of Mixed Direct-Iterative 
Hierarchical Scheme. 


Tolerance 



No. of Iterations 
113 


10 


-6 


118 


10 


-6 


118 


of Freedom 


- Requirements of Purely 


T o 1 erance 
10- 5 


10 


-5 


10 


-5 


10 


-5 


10 


-5 


Iterative Hierarchical Scheme 

No. of Iterations 
31 

65 

92 

117 

150 
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